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5 Efficient tests of association for quantitative traits and affected-unaffected 

studies using pooled dna 

Related Application 

This application claims priority to U. S. Ser. No. 60/238,381, filed October 6, 2000 
[21402-1 39] which is incorporated herein by reference in its entirety. 

1 0 Background of the Invention 

The complex diseases that present the greatest challenge to modern medicine, 
including cancer, cardiovascular disease, and metabolic disorders, arise through the interplay 
of numerous genetic and environmental factors. One of the primary goals of the human 
genome project is to assist in the risk-assessment, prevention, detection, and treatment of these 

1 5 complex disorders by identifying the genetic components. Disentangling the genetic and 
environmental factors requires carefully designed studies. One approach is to study highly 
homogenous populations (Nillson and Rose 1999; Rabinow, 1999; Frank 2000). A recognized 
drawback of this approach, however, is that disease-associated markers or causative alleles 
found in an isolated population might not be relevant for a larger population. An attractive 

20 alternative is to use well-matched affected-unaffected studies of a more diverse population 

Even with a well-matched sample set, the genetic factors contributing to an aberrant 
phenotype may be difficult to determine. Traditional linkage analysis methods identify 
physical regions of DNA whose inheritance pattern correlates with the inheritance of a 

25 particular trait (Liu 1997; Sham 1997, Ott 1999). These regions may contain millions of 

nucleotides and tens to hundreds of genes, and identifying the causative mutation or a tightly 
linked marker is still a challenge. A more recent approach is to use a sufficiently dense 
marker set to identify causative changes directly. Single nucleotide polymorphisms, or SNPs, 
can provide such a marker set (Cargill et al. 1999). These are typically bi-allelic markers with 

30 linkage disequilibrium extending an estimated 10,000 to 100,000 nucleotides in heterogeneous 
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human populations (Kruglyak 1999; Collins et al. 2000). Tens to hundreds of thousands of 
these closely spaced markers are required for a complete scan of the 3 billion nucleotides in 
the human genome. Because each SNP constitutes a separate test, the significance threshold 
must be adjusted for multiple hypotheses (p-value ~ 10" 8 ) to identify statistically meaningful 
associations. Consequently, hundreds to thousands of individuals are required for association 
studies (Risch and Merikangas 1996). 

The most powerful tests of association require that each individual be genotyped for 
every marker (Fulker et al. 1995, Kruglyak and Lander 1995, Abecasis et al. 2000, Cardon 
2000) and remain far too costly for all but testing candidate genes. An alternative that 
circumvents the need for individual genotypes, related to previous DNA pooling methods for 
determination of linkage between a molecular marker and a quantitative trait locus (Darvasi 
and Soller 1994), is to determine allele frequencies for sub-populations pooled on the basis of 
a qualitative phenotype. Populations of unrelated individuals, separated into affected and 
unaffected pools, have greater power than related populations. Limited guidance has been 
provided, however, regarding the sample size requirement of tests using pooled DNA relative 
to individual genotyping, or the efficiency of tests based on a quantitative phenotype relative 
to an affected/unaffected design. 

The phenotypes relevant for complex disease are often quantitative, however, and 
converting a quantitative score to a qualitative classification represents a loss of information 
that can reduce the power of an association study. The location of the dividing line for 
affected versus unaffected classification, for example, can affect the power to detect 
association. Furthermore, pooling designs based on a comparison of numerical scores are not 
even possible with a qualitative classification scheme. These distinctions can be especially 
relevant when populations contain related individuals and qualitative tests have a disadvantage 
(Risch and Teng 1998). 

When performing risk assessment to determine whether a person suffers from or is at 

risk of developing a complex disorder often requires measuring an underlying quantitative 

phenotype. Association studies in unrelated populations can implicate genetic factors 

contributing to disease risk, and experiments using pooled DNA provide a less costly but 

necessarily less powerful alternative to methods based on individual genotyping. Association 
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studies require markers in linkage disequilibrium with causative genetic polymorphisms. 
Although the sample sizes required for pooling and individual genotyping studies have been 
compared in certain instances, general results have not been reported in the context of 
association studies, nor have there been clear comparisons of pooling based on quantitative 
and qualitative (affected/unaffected) phenotypes. Association tests of DNA pooled on the 
basis of a quantitative phenotype are analogous to selection experiments for quantitative trait 
locus (QTL) mapping. For a QTL with a weak effect on a phenotype, the mean phenotypic 
value of individuals selected to exceed a threshold is proportional to the mean allele 
enrichment. This suggests that genotyping of a certain percentage of the upper and lower 
phenotypic values of an unrelated population is useful to estimate the effect of a marker on a 
quantitative phenotype, such as in pooling studies. There is a need in the art to examine the 
sample size requirements of association tests for quantitative traits using pooled DNA. 

Summary of the Invention 

The present invention is based, in part, on the discovery of methods to detect an 
association in a population of individuals between a genetic locus and a quantitative 
phenotype, where two or more alleles occur at a given genetic locus, and the phenotype is 
expressed using a numerical phenotypic value whose range falls within a first numerical limit 
and a second numerical limit. These limits are used to provide for subpopulations that consist 
of upper and lower pools. 

In some embodiments, the population of individuals includes individuals who may be 
classified into classes. In certain aspects of the invention, these classes are based on age, 
gender, race, or ethnic origin. In other aspects, some or all members of a class are included in 
the pools. 

In various embodiments, these numerical limits are chosen so that the upper pool 
includes the highest 19%, 27%, or 37% of the population. In other embodiments, the 
numerical limits are chosen such that the lower pool includes the lowest 19%, 27%, or 37% of 
the population. 

In some embodiments, the upper and lower pools have the same number of individuals. 

In one embodiment of the invention, the numerical limits are chosen to correlate with 
error of measurement determinations. In some embodiments, the numerical limit on the error 
of measurement is about 0.04 or about 0.01. 



In some embodiments, methods to detect an association in a population of individuals 
between a genetic locus and a quantitative phenotype are useful to determine the genetic basis 
of disease predisposition. 

In other embodiments, the genetic locus analyzed contains a single nucleotide 
polymorphism. 

In the present invention, the population of individuals can include unrelated 
individuals. 

Unless otherwise defined, all technical and scientific terms used herein have the same 
meaning as commonly understood by one of ordinary skill in the art to which this invention 
belongs. Although methods and materials similar or equivalent to those described herein can 
be used in the practice or testing of the present invention, suitable methods and materials are 
described below. All publications, patent applications, patents, and other references 
mentioned herein are incorporated by reference in their entirety. In the case of conflict, the 
present specification, including definitions, will control. In addition, the materials, methods, 
and examples are illustrative only and not intended to be limiting. 

Other features and advantages of the invention will be apparent from the following 

detailed description and claims. 

Brief Description of the Figures 

Fig. 1 . The sample size required to achieve a type I error rate of 5x 10" 8 and a power of 0.8 for 
a QTL for a complex trait is shown for pooled DNA designs relative to individual genotyping. 
The ratio A^^hdiv for affected-unaffected pools (dashed line) is shown as a function the 
disease incidence r, while the ratio Maii/A^diy (solid line) is shown as a function of the fraction 
p of the total population selected for each pool. The optimum value of A^aii/?Vindiv is 1 .24, 
occurring at p = 27% selected for each pool. 



Fig. 2a Exact numerical results for the sample size TV required to achieve a type I error rate of 
5x10"° with a power of 0.8 are shown for affected-unaffected pools (dashed line) and tail pools 
(solid line) as a function of the additive variance, or equivalently the genotype relative risk for 
a heterozygote, for an allele with frequency 0.1 and purely additive variance. Analytic 
approximations (solid circles), Eqs. 1 and 2, are indistinguishable from the exact results when 
the genotype relative risk is smaller than a factor of 2. The disease incidence r is 10% for the 
affected-unaffected pools, and 27% of the population is selected for the each of the tail pools. 

Fig. 2b The frequency difference at the significance threshold is shown for the same 
parameters as panel a. This threshold determines the measurement accuracy required for an 
association test based on pooled DNA. 

Detailed Description of the invention 

The present invention provides analytic results for association tests. It is shown that the 
results obtained closely approximate the analytic results to exact numerical calculations. The 
invention further extends the analysis to qualitative phenotypes using a genotype relative risk 
model. 

A particular quantitative phenotype is standardized to have unit variance and zero mean. 
The phenotype is hypothesized to be affected by alleles A i and A 2 , with frequencies p and l-p 
respectively, at a particular QTL. The population fractions P(G) for genotypes G = A X A U 
A\A 2 , and A2A2 are assumed obey Hardy-Weinberg equilibrium. Using standard notation for a 
variance components model (Falconer and MacKay, 1996), the effect /jq of genotype G on 
phenotype X is a-fj, for A X A U d-/j for A\A 2 , and -a-fi for A 2 A 2 . The constant fi = (2p-l)a 



+ 2p{\-p)d ensures that the mean of Xis zero. The ratio dla describes the inheritance mode 
for allele A \. Dominant, recessive, and additive inheritance are special cases with dla equal to 
+1,-1, and 0, respectively. 

The phenotypic variance due to the QTL may be partitioned into the additive variance <x A 2 and 

the dominance variance <j D 2 ? with 

OX 2 + cr D 2 - 2pq[a^d(p^q)f + 4p 2 q 2 S. 

The additive variance is often much larger than the dominance variance even if the inheritance 
mode is not purely additive. The exceptions are QTLs with a recessive minor alleles and 
dominant major alleles, which are difficult to detect in unselected populations. The 
contribution of remaining genetic and environmental factors is assumed to follow a normal 
distribution with residual variance <r R 2 , 

C7 R 2 -HOa 2 W). 

Of particular interest here are complex traits: the effect of any single QTL is small, cr A 2 +cr D 2 < 
0.05, and the residual variance cr R is nearly 1 . 

A genotype relative risk model corresponds to classifying individuals as affected (X> X T ) or 
unaffected (X<X T ) based on a specific threshold X T . The proportion r of the total population 
that is affected is the overall risk or disease incidence; the probability that an individual with 
genotype G is affected, relative to the probability for an individual with genotype A 2 A 2 , is the 
genotype relative risk. If the inheritance mode of A i is additive and a is small compared to cr R , 
the relative risk is multiplicative with allele dose. 



The sample size N required to detect association between genotype G and the quantitative 
phenotype or the disease risk depends on the type I error rate a, the type II error rate /?, and the 
test statistic and experimental design (Snedecor and Cochran, 1989), as well as on the 
underlying genetic model. For a one-sided test of a single marker, a = 1 - ®(z a ), where 0(z) 
is the cumulative probability distribution for standard normal deviate z, defines a in terms of 
deviate z a - Similarly, l-fi is the power to reject the null hypothesis and z\^ = <3>~ l (J3). For a 
genome scan, the values a = 5xl(T 8 (z a = 5.33) and 1-/3= 0.8 (?\-p = -0.84) have been 
suggested (Risch and Merikangas, 1996). 

We consider two experimental designs using DNA pooled from individuals selected from a 
sample of size N: affected-unaffected pools, with DNA pooled from n affected and n 
unaffected individuals; and tail pools, with DNA pooled from n individuals at each tail of the 
phenotype distribution. The test statistic for these designs is the frequency difference of the A \ 
allele between the pools. The multinomial distribution describing the test statistic may be used 
to calculate exactly the sample size required to achieve statistical significance at specified 
power. 

When the number of A\ alleles summed over both pools is large, the distribution of the test 
statistic is approximately normal. A significant association is detected if the allele frequency 
difference between pools is at least z a times the standard deviation of its estimator, or z«p 1/2 (l- 
p) m ln m . Furthermore, when the additive variance <x A 2 is small and the residual variance 

2 * 

Or is close to 1, convenient analytic approximations for the sample size requirements may be 
derived. 



For the affected-unaffected design, n = rN of the individuals are expected to be diagnosed as 
affected, and an additional n matched controls are selected from the remainder of the 
population. The analytic approximation for the sample size is 

N c< = [za-zx-pf [gr 2 /ox 2 ] • 2r(l-r)V [1 +Z 7 (l-a k 2 ) 1/2 /2 3/2 a k 2 p 1/2 (l-jr>) 1/2 ] 2 . (Eq. 1) 
The term y is the height of the standard normal distribution at the normal deviate X7/0R 
corresponding to the threshold between affected and unaffected phenotypic values. 

The tail pools are parameterized by the fraction p = nIN of population selected for each pool, 
and p plays a role analogous to the overall disease incidence r in the affected-unaffected 
design. The analytical approximation for the sample size is 

Mail = [Za-Zl-fi] 2 [Qk 2 /<7 A 2 ] * P^A (Eq. 2) 

where y is the height of the standard normal distribution for normal deviate The 
design may be optimized by selecting p to minimize AW, which corresponds to minimizing 
plly . With this approximation, the optimal fraction is 0.27 and is independent of a, ft, and all 
parameters of the genetic model. 

A third method, individual genotyping, serves as a baseline for evaluating the efficiency of the 
two pooling-based methods. The sample size required to achieve significance using individual 
genotyping is 

Mndiv = \?a ~ Zl-j0Ofc] / 0"A , (Eq. 3) 

based on a regression model of phenotypic value on allele dose. 



Detailed Description of Analytical Methods 

The genotype-dependent phenotype distribution in the variance components model is 
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P{X\G) = (2^)- i/2 expKX-^) 2 /a- R 2 ] ? 

and the overall phenotype distribution is the sum of the three normal distributions, 
P{X)^ G P{X\G)P{G). 

When an upper threshold X v is specified to select a fraction p of the total population with 
phenotypic values above the threshold, the equation 
p = 2 G {1 -<D[(Xtr-/^)/o- G ]}P(G). 

may be solved numerically for Xjj as a function of /?. The genotypes of individuals selected by 
X> Xu follow a multinomial distribution; the probability that an individual has genotype G is 
&u(G) ={l-<S>[(Xir~juo)/oG]}P(G)/p. A multinomial distribution is similarly defined using a 
lower threshold Xl, 

1 = So 0i(G) = p-% <S>[(Xl-Mg)/ctg}P(G). 

For an affected-unaffected design, the fraction in the upper pool is r and the fraction in the 
lower pool is 1-r, yielding Xu = X L =X T . The relative risk for genotype G is [@u(G)/P(G)]/ 
[6 V (A 1 A 2 )IP{A 2 A 2 )}. 

Sample size requirements may be obtained directly from the multinomial distributions of 
genotypes by exhaustively tabulating allele counts Cu and C L in the upper and lower pools for 
each distinct composition of genotypes among the n selected individuals. The distribution 
corresponding to null hypothesis, 6(G) = P(G), is used to define the smallest threshold AC 
such that Cit-Cl ^ AC with probability a or less. The discrete allele count usually yields the 
strict inequality. Next, the distributions under the alternative hypothesis are considered, and 
the probability that C\j-Ci> AC is tabulated to provide the power. If the power is greater 
than or equal to the specified l-/? ? the choice of n and N = nip or nlr is feasible. A search is 



performed for the smallest feasible TV with r or p specified. For tail pools, p is then varied to 
find the overall optimum. 



When the number of alleles summed over both pools is large, the allele frequency difference 
follows a normal distribution. Under the null hypothesis, the mean is zero and variance is 
(To In =p(l-p)ln. This result is derived by noting that the variance of the frequency difference 
is twice the variance of the mean for a single pool of n individuals. The allele frequency 
variance for an individual is p(l~p)/2, and averaging over the n individuals reduces the 
variance by the factor n. Under the alternative hypothesis, the expected allele frequency 
difference Ap is 

Ap= Pu -p L = S G [ 6dG)-e L (G)} p G 

where the genotype-dependent allele frequency pa is 1 for G = A\A\, 0.5 for A\Az, and 0 for 
A2A2. The variance is <7\ In, where o\ is obtained from the multinomial distribution (Beyer, 
1984), 

oi 2 = S G [9dG) + e L (G)] PG 2 - (pu 2 +Pl). 

The number of individuals required per pool for type I error a and power 1— /?is 

n = [z a ao - z\-p&{\ lAp . 

For affected-unaffected pools, nlr is the required sample size. For tail pools, N= nip, and 
p is varied to find the smallest N. 



The normal approximation underestimates the sample size requirement relative to the exact 

results from the multinomial distribution. When the sum of the alleles in both pools is at least 

60, the difference in sample sizes is no greater than 5%. We chose 60 alleles in both pools as 

the criterion for switching from the multinomial to the normal calculation. Standard algorithms 
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were employed to perform the root search forXu andZL, the optimization, and the integration 
over the tail of a normal distribution (Press, 1997). 

The analytic results are obtained by setting oi to ao and expanding Ap to second order in the 
5 effect size po> corresponding loosely to a perturbation theory for probability distributions 
(Chandler, 1987). From a Taylor series expansion, 
<b(z-b) = &(z)-by- (1/2)&V , 

where y = (2ri) exp(-z 12). Substituting this result into the expressions for 6(G) using b = 
jug/or and z = Xu/or = <I> _1 (1-/?), where Xis the threshold used to select the pool, yields for the 
10 tail design 

Pu=P + iylp) E[(a?/or)pg] + (y\z\/2p) E[(pg/or) 2 Pg] and 
Pl=P~ (yip) E[(jUG/a K )p G ] + (y\z\/2p) E[(pa/ a K fp G ]. 

The corresponding results for the affected-unaffected pools, with z = 0 _1 (l-r), are 
Pu° s P + iylr) E[(vg/or)pg] + (y\z\/2r) E[{jjgI as) 2 p G ] and 
15 P l=P- [y/(l-r)]E[Cu G / 0k )/; G ] - fr|z|/2(l-r)] Wpd ofc)W 
The required expectation values are 
E[^ G ] = T lG P(G)ju G p G = cr A \p(l-p)/2] m , and 

^Wpg] = ^ G P(G)^ 2 p G = (l/2)(l-cr R 2 ) - Ap\\-pfad + (2p-l)<r D 2 /2 * <t a 2 /2. 
The results for Ap, 

1 f) 

20 Ap = 2 yooCFpJpa^ , tail pools, and 

Ap = [l+XraA/2 3/2 aoaR]yaoaA/2 l/2 r(l-r)a^, affected-unaffected pools, 
lead directly to Eqs. 1 and 2. 
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Approximate genotype relative risks may also be obtained from the Taylor series expansion 
for 6(G). To lowest order, the relative risk for the heterozygote is approximately 1 + 
(d+a)y/ra^ and for the A\A\ homozygote is 1 + laylra^. For additive inheritance, d = 0, and 
the relative risk is multiplicative with allele dose when aylro^ is small. For a complex trait cr R 
is close to 1, and for a minor allele, a w &pJQp) m . When the disease incidence is 10%, the 
parameter required to be small is 1 2A<jpJp m . 

For individual genotyping, the regression model used to test significance is 
X=bx(p G -p) + £, 

where the residual contribution s to the phenotype has zero mean and is uncorrected with p G . 

Using standard statistical methods (Snedecor, 1989), the test statistic b\ under the null 

hypothesis has mean zero and variance Var(6i|null) given by 

Var(6i |null) =i\T 1 Var(X)/Var(p G ) = l/N\p(l-p)/2l 

Under the alternative hypothesis, the expectation for the test statistic is 

E(60 = Cov(X,p G )/Vn(X) = <J A \p(\~p)l2] y \ 

and its variance is 

Var(&!|alt) - N~ x Var(£)/Var(/? G ) = <x R 2 / N \p(\-p)I2l 

The sample size required for a one-sided test of b\ with Type I error a and power l-fi is 
N= [^Var^jlnull) 172 -Zi^Var^^alt) 172 ] 2 ^!) 2 , 
which is the result provided in Eq. 3. 

Application of the methods of the invention 

The sample sizes required for the pooled DNA designs are compared in Fig. 1 to the sample 

size M n div required by individual genotyping. The ratio N c JNi n div (dashed line) is a function of 
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the disease incidence r, while iVtaivWindiv (solid line) is a function of the pooling fraction p. For 
typical disease incidence, r ~ 10%, the affected-unaffected design requires a sample 5.3x 
larger than that required for individual genotyping. Compared to the tail design, it measures 
an allele frequency difference that is half as large and is approximately 4x less efficient. The 
tail design, with p = 27%, requires a sample only 1 .24 x larger than required for individual 
genotyping. The tail design is also robust to variation in p near its optimum, as values from 
19% to 37% drop the efficiency no more than 5%. 

The analytic theory indicates that the additive variance cr A 2 ? or equivalently the genotype 
relative risk for an allele of known frequency, is the most important factor determining the 
sample size requirements. This dependence is shown in Fig. 2a with exact numerical results 
for affected-unaffected pools (dashed line) and tail pools (solid line) for type I error of 5x1 (T 8 
and power of 0.8. The minor allele frequency is 10%, its effect on the quantitative phenotype 
is purely additive, and the disease incidence is 10%. The analytic approximations (solid 
circles) from Eq. 1 and 2 are nearly indistinguishable from the exact results when the genotype 
relative risk drops below a factor of 2. As predicted by the analytic theory, the tail pools 
require smaller sample sizes than the affected-unaffected pools, and the gap grows wider for 
alleles with a smaller effect on the phenotype. For relative risks of 2 to 5, the deviations from 
analytic theory are moderate; above a relative risk of 5, the phenotype is monogenic with 
respect to locus G, and the analytic approximations for complex traits are no longer valid. 

The allele frequency difference between pools at the significance threshold is shown in Fig. 2b 
for affected-unaffected pools (dashed line) and tail pools (solid line). The measurement error 
in the allele frequency difference must be smaller than the significance threshold to detect 
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association (Darvasi, 1994). Evaluations that provide a frequency difference measurement 
accurate to 0.04 can detect association with alleles responsible for 1% of the total phenotypic 
variance, corresponding to a heterozygote relative risk of 1 .5. The allele frequency difference 
measurement must be accurate to 0.01 to detect association with an allele explaining 0.1% of 
the phenotypic variance, corresponding to a relative risk of 1 . 14. 

To test the range of validity of the analytic estimates for pooling, we performed a series of 
exact calculations of sample size requirements as a function of p and dia. Large deviations 
were seen only when the magnitude of a gene effect /iq approached or in size, or, 
equivalently, when ax was larger than the minor allele frequency or when a genotype relative 
risk was larger than 5 (results not shown). For additive contributions from a minor allele, the 
range of validity corresponds to ox < 2p. 

The advantages of the methods disclosed herein include the following. The optimal fraction 
for tail pooling, 27%, is independent of all model parameters including allele frequency, 
inheritance mode, effect size, and type I error and power, for virtually any QTL contributing to 
a complex trait. The exceptions to this finding are rare QTLs with relative risks of 5 or 
greater, and rare, recessive alleles, both of which are more difficult to detect than more 
frequent alleles contributing to the same overall phenotypic variance. In addition, the tail 
design is approximately 4-fold more efficient than the affected-unaffected design and requires 
a sample size only 24% larger than for individual genotyping. Still further, DNA pooling 
studies designed according to the present procedures disclosed herein provide extremely 
efficient methods for large-scale screening and should help to make feasible genome-wide 
association studies. 
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OTHER EMBODIMENTS 

While the invention has been described in conjunction with the detailed description 
thereof, the foregoing description is intended to illustrate and not limit the scope of the 
invention, which is defined by the scope of the appended claims. Other aspects, advantages, 
and modifications are within the scope of the following claims. 
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